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(N 

O ■ 1. Introduction 

(N 

Q^' The expectation-maximization (EM) algorithm introduced by Dempster et al [12] in 1977 is a 

very general method to solve maximum likelihood estimation problems. In this informal report, 
we review the theory behind EM as well as a number of EM variants, suggesting that beyond 
the current state of the art is an even much wider territory still to be discovered. 



^ ; 2. EM background 

I Let Y a random variable with probability density function (pdf) p{y\0), where 6 is an unknown 

c/2 ' parameter vector. Given an outcome y of Y, we aim at maximizing the likelihood function 

C{9) = p{y\9) wrt 9 over a given search space G. This is the very principle of maximum 
CN ! likelihood (ML) estimation. Unfortunately, except in not very exciting situations such as, e.g. 

estimating the mean and variance of a Gaussian population, a ML estimation problem has 
■ generally no closed- form solution. Numerical routines are then needed to approximate it. 

2.1. EM as a likelihood maximizer 

in 

' The EM algorithm is a class of optimizers specifically taylored to ML problems, which makes 

it both general and not so general. Perhaps the most salient feature of EM is that it works 
iteratively by maximizing successive local approximations of the likelihood function. Therefore, 
each iteration consists of two steps: one that performs the approximation (the E-step) and one 
^ ' that maximizes it (the M-step). But, let's make it clear, not any two-step iterative scheme is an 

, EM algorithm. For instance, Newton and quasi-Newton methods [27] work in a similar iterative 

fashion but do not have much to do with EM. What essentially defines an EM algorithm is the 
philosophy underlying the local approximation scheme - which, in particular, doesn't rely on 
differential calculus. 

The key idea underlying EM is to introduce a latent variable Z whose pdf depends on 9 with 
the property that maximizing p{z\9) is easy or, say, easier than maximizing p{y\9). Loosely 
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speaking, we somehow enhance the incomplete data by guessing some useful additional infor- 
mation. Technically, Z can be any variable such that — t- Z — > 1" is a Markov chain^, i.e. we 
assume that 'p{v\z^d^ is independent from 0, yielding a Chapman-Kolmogorov equation: 

v{^z,y\Q)=v{.z\e)v{y\z) (1) 

Reasons for that definition will arise soon. Conceptually, Z is a complete-data space in the 
sense that, if it were fully observed, then estimating Q would be an easy game. We will emphasize 
that the convergence speed of EM is highly dependent upon the complete-data specification, 
which is widely arbitrary despite some estimation problems may have seamingly "natural" 
hidden variables. But, for the time being, we assume that the complete-data specification step 
has been accomplished. 



2.2. EM as a consequence of Jensen's inequality 

Quite surprisingly, the original EM formulation stems from a very simple variational argument. 
Under almost no assumption regarding the complete variable Z, except its pdf doesn't vanish 
to zero, we can bound the variation of the log-likelihood function L{Q) = logp{y\6) as follows: 

m-m ^ 10.^ 



log 



f Piz,y\9) . I 



= log J ^^p{z\y,e')dz (2) 

' 

Call this Q{e,e') 

Step (2) results from the fact that p{y\z,9) is independent from owing to (1). Step (3) 
follows from Jensen's inequality (see [9] and appendix A. 2) along with the well-known con- 
cavity property of the logarithm function. Therefore, Q{9,9') is an auxiliary function for the 
log-likelihood, in the sense that: (i) the likelihood variation from 9' to 9 is always greater than 
Q{9, 9'), and (ii) we have Q{9', 9') = 0. Hence, starting from an initial guess 9' , we are guaran- 
teed to increase the likelihood value if we can find a 9 such that Q{9, 9') > 0. Iterating such a 
process defines an EM algorithm. 

There is no general convergence theorem for EM, but thanks to the above mentioned mono- 
tonicity property, convergence results may be proved under mild regularity conditions. Typi- 
cally, convergence towards a non-global likelihood maximizcr, or a saddle point, is a worst-case 
scenario. Still, the only trick behind EM is to exploit the concavity of the logarithm function! 



^In many presentations of EM, Z is as an aggregate variable {X, Y), where X is some "missing" data, which cor- 
responds to the special case where the transition Z Y is deterministic. We believe this restriction, although 
important in practice, is not useful to the global understanding of EM. By the way, further generalizations 
will be considered later in this report. 
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2.3. EM as expecation-maximization 



Let's introduce some notations. Developing the logarithm in the right-hand side of (3), we may 
interpret our auxiliary function as a difference: Q{9,9') = Q{6\9') — Q{9'\6'), with: 



Clearly, for a fixed 6', maximizing Q{9,9') wrt 9 is equivalent to maximizing Q{9\9'). If we 
consider the residual function: R{9\9') = L{9) — Q{9\9'), the incomplete-data log-likelihood may 
be written as: 



The EM algorithm's basic principle is to replace the maximization of L{9) with that of Q{9\9'), 
hopefully easier to deal with. We can ignore R{6\9') because inequality (3) implies that R{9\9') > 
R{9'\9'). In other words, EM works because the auxiliary function Q{9\9') always deteriorates 
as a likelihood approximation when 9 departs from 9'. In an ideal world, the approximation 
error would be constant; then, maximizing Q would, not only increase, but truly maximize the 
likelihood. Unfortunately, this won't be the case in general. Therefore, unless we decide to 
give up on maximizing the likelihood, we have to iterate - which gives rise to quite a popular 
statistical learning algorithm. 

Given a current parameter estimate 9n- 

• E-step: form the auxiliary function Q{9\9n) as defined in (4), which involves computing 
the posterior distribution of the unobserved variable, p{z\y, 9n)- The "E" in E-step stands 
for "expectation" for reasons that will arise in section 2.4. 

• M-step: update the parameter estimate by maximizing the auxiliary function: 



An obvious but important generalization of the M-step is to replace the maximization 
with a mere increase of Q{9\9n)- Since, anyway, the likelihood won't be maximized in 
one iteration, increasing the auxiliary function is enough to ensure that the likelihood will 
increase in turn, thus preserving the monotonicity property of EM. This defines generalized 
EM (GEM) algorithms. More on this later. 

2.4. Some probabilistic interpretations here... 

For those familiar with probability theory, Q{9\9') as defined in (4) is nothing but the conditional 
expectation of the complete-data log-likelihood in terms of the observed variable, taken aXY = y, 
and assuming the true parameter value is 9': 



This remark explains the "E" in E-step, but also yields some probabilistic insight on the 
auxiliary function. For all 9, Q{9\9') is an estimate of the the complete-data log- likelihood that 
is built upon the knowledge of the incomplete data and under the working assumption that the 
true parameter value is known. In some way, it is not far from being the "best" estimate that 




(4) 



L{9) = Q{9\9') + R{9\9') 



9n+i = argmaxQ{9\9n) 



Qi9\9')^E[\ogpiZ\9)\y,9'] 



(5) 
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we can possibly make without knowing Z, because conditional expectation is, by definition, the 
estimator that minimizes the conditional mean squared error^. 

Having said that, we might still be a bit suspiscious. While we can grant that Q{9\9') is 
a reasonable estimate of the complete-data log-likelihood, recall that our initial problem is to 
maximize the incom,plete-data (log) likelihood. How good a fit is Q(9\6') for L{9)1 To answer 
that, let's see a bit more how the residual R{9\9') may be interpreted. We have: 



where the last step relies on Bayes' law. Now, p{y\z,9) = p{y\z) is independent from 9 by the 
Markov property (1). Therefore, using the simplified notations qe{z) = p{z\y,9) and qe'{z) = 
PiAv^O')^ we get: 



In the language of information theory, this quantity D{q0i\\qQ) is known as the Kullback- 
Leibler distance, a general tool to assess the deviation between two pdfs [9]. Although it is not, 

strictly speaking, a genuine mathematical distance, it is always positive and vanishes iff the 
pdfs are equal which, again and not surprinsingly, comes as a direct consequence of Jensen's 
inequality. 

What does that mean in our case? We noticed earlier that the likelihood approxima- 
tion Q{9\9') cannot get any better as 9 deviates from 9' . We now realize from equation (7) 
that this property reflects an implicit strategy of ignoring the variations of p{z\y^9) wrt 9. 
Hence, a perfect approximation would be one for which p{z\y,9) is independent from 9. In 
other words, we would like 9 ^ Y ^ Z to define a Markov chain... But, look, we already 
assumed that 6* — > Z — > y is a Markov chain. Does the Markov property still hold when 
permuting the roles of Y and Z? 

From the fundamental data processing inequality [9], the answer is no in general. Details 
are unnecessary here. Just remember that the validity domain of Q{9\9') as a local likelihood 
approximation is controlled by the amount of information that both y and 9 contain about the 
complete data. We are now going to study this aspect more carefully. 

2.5. EM as a fix point algorithm and local convergence 

Quite clearly, EM is a fix point algorithm: 

9n+i = ^{9n) with ^{9') = argmaxQ(^, 9') 

Assume the sequence 9n converges towards some value 9 - hopefully, the maximum likelihood 
estimate but possibly some other local maximum or saddle point. Under the assumption that 
$ is continuous, 9 must be a fix point for i.e. 9 = ^{9). Furthermore, we can approximate 




(6) 




(7) 



Call this D{qgi\\q0) 



^For all 9, we have: Q{9\6') = argmin 




4 



the sequence's asymptotic behavior using a first-order Taylor expansion of $ around 6, which 
leads to: 

en+i ^ se + {I - s)en with s = i-^\^ 

This expression shows that the rate of convergence is controlled by S, a square matrix that is 
constant across iterations. Hence, S is called the speed matrix, and its spectral radius^ defines 
the global speed. Unless the global speed is one, the local convergence of EM is only linear. We 
may relate S to the likelihood function by exploiting the fact that, under sufficient smoothness 
assumptions, the maximization of Q is characterized by: 

■^{0n+l,0n) = 

From the implicit function theorem, we get the gradient of $: 



^ / d^Q \-i d^Q / O^Q \-ir d^Q d^Q 

do ~ \dedd*) 80' de^ ^ ~ \d0d0t) 18080* 80' 80* 



where, after some manipulations: 

8'^Q . f . . ^. d'^logp{z\0). 



•' " . ' 

Call this -Jz(^) 

8'^Q . d'^\ogv[y\0) d'^Q . 



dz 



QQ,QQtK6fi) d0d0* 

^ V ' 

Call this -Jy{e) 

The two quantities Jy{0) and Jz{0) turn out to be respectively the observed-data information 
matrix and the complete-data information matrix. The speed matrix is thus given by: 

S = JM-^Jy{0) with J,{0) = ^J,{0)\y,0] (8) 

We easily check that: Jz{0) = Jy{0) + ^z\y{^)i where J^z\y{^) is the Fisher information matrix 
corresponding to the posterior pdf which is always symmetric positive. Therefore, we 

have the alternative expression: 

S=[Jy{0)+Fz\ym-'jy{0) 

For fast convergence, we want 5 close to identity, so we better have the posterior Fisher 
matrix as "small" as possible. To interpret this result, let's imagine that Z is drawn from 
p{z\y,0), which is not exactly true since may be at least slightly different from the actual 
parameter value. The Fisher information matrix represents the average information that the 
complete data contains about 9 conditional to the observed data. In this context, Tz\y{0) is 
a measure of missing information, and the speed matrix is the fraction of missing data. The 
conclusion is that the rate of convergence of EM is governed by the fraction of missing data. 



^Let (Ai, A2, . . . , Am) be the complex eigenvalues of S. The spectral radius is p(5) = mini |Af | 
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2.6. EM as a proximal point algorithm 

Chretien k, Hero [7] note that EM may also be interpreted as a proximal point algorithm, i.e. 
an iterative scheme of the form: 

9n+i = argmax [L{B) - A„*(^, 9n)], (9) 


where ^ is some positive regularization function and A„ is a sequence of positive numbers. 

Let us see where this result comes from. In section 2.4, we have established the fundamental 
log-likelihood decomposition underlying EM, L{9) = Q{9\9') + R{9\9') , and related the variation 
of R{9\9') to a KuUback distance (7). Thus, for some current estimate we can write: 

Q{9\9n) = L{9) - D{qeAqe) - R{9n\9n), 

where qe{z) = p{z\y,9) and qe„{^) = p{^\y^^n) are the posterior pdfs of the complete data, 
under 9 and 9n, respectively. From this equation, it becomes clear that maximizing Q{9\9n) is 
equivalent to an update rule of the form (9) with: 

^{9,9n)=D{qgJqe), A„ = 1 

The proximal interpretation of EM is very useful to derive general convergence results [7]. 

In particular, the convergence rate may be superlinear if the sequence A,„ is chosen so as to 
converge towards zero. Unfortunately, such generalizations are usually intractable because the 
objective function may no longer simplify as soon as A„ 7^ 1. 

2.7. EM as maximization-maximization 

Another powerful way of conceptualizing EM is to reinterpret the E-step as another maxi- 
mization. This idea, which was formalized only recently by Neal & Hinton [26], appears as a 
breakthrough in the general understanding of EM-type procedures. Let us consider the following 
function: 

L{9,q)=¥.^[\ogp{Z,y\9)]+H{q) (10) 

where ^(z) is some pdf (yes, any pdf), and is its entropy [9], i.e. H{q) = — J \ogq{z) q{z) dz. 
We easily obtain an equivalent expression that involves a KuUback-Leiber distance: 

L{9,q)=L{9)-D{q\\qe), 

where we still define q0{z) = p{z\y,9) for notational convenience. The last equation reminds 
us immediately of the proximal interpretation of EM which was briefly discussed in section 2.6. 
The main difference here is that we don't impose q{z) = p{z\y, 9) for some 9. Equality holds 
for any distribution! 

Assume we have an initial guess 9n and try to find q that maximizes L{9n,q)- From the 
above discussed properties of the KuUback-Leibler distance, the answer is q{z) = qQ^(z). Now, 
substitute qo^ in (10), and maximize over 9: this is the same as performing a standard M-step^! 
Hence, the conventional EM algorithm boils down to an alternate maximization of L{9, q) over 
a search space O x Q, where Q is a suitable set of pdfs, i.e. Q must include all pdfs from the 
set {q{z) = p{z\y,9), 9 € 0}. It is easy to check that any global maximizer (9,q) of L{9,q) is 



*To sec that, just remember that p{z, y\6) = p{z\9)p{y\z) where p{y\z) is independent from 6 due to the Markov 
property (1). 
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such that 9 is also a global maximizer of L{9). By the way, this is also true for local maximizers 

under weak assumptions [26]. 

The key observation of Neal &; Hinton is that the alternate scheme underlying EM may 
be replaced with other maximization strategies without hampering the simplicity of EM. In 
the conventional EM setting, the auxiliary pdf qn{z) is always constrained to a specific form. 
This is to say that EM authorizes only specific pathways in the expanded search space Q x Q, 
yielding some kind of "labyrinth" motion. Easier techniques to find its way in a labyrinth 
include breaking the walls or escaping through the roof. Similarly, one may consider relaxing 
the maximization constraint in the E-step. This leads for instance to incremental and sparse 
EM variants (see section 3). 

3. Deterministic EM variants 

We first present deterministic EM variants as opposed to stochastic variants. Most of these 
deterministic variants attempt at speeding up the algorithm, either by simplifying computations, 
or by increasing the rate of convergence (see section 2.5). 

3.1. CEM 

Classification EM [6]. The whole EM story is about introducing a latent variable Z and per- 
forming some inference about its posterior pdf. We might wonder: why not simply estimate Zl 
This is actually the idea underlying the CEM algorithm, which is a simple alternate maximiza- 
tion of the functional p{z,y\9) wrt both 9 and z. Given a current parameter estimate 9n, this 
leads to: 

• Classification step: find z„ = argmaxp(2;|y, 

2 

• Maximization step: find 9n+i = argmaxp(2;„|^). 

6 

Notice that a special instanciation of CEM is the well-known fc-means algorithm. In prac- 
tice, CEM has several advantages over EM, like being easier to implement and typically faster 
to converge. However, CEM doesn't maximize the incomplete-data likelihood and, therefore, 
the monotonicity property of EM is lost. While CEM estimates the complete data explicitely, 
EM estimates only sufficient statistics for the complete data. In this regard, EM may be under- 
stood as a fuzzy classifier that avoids the statistical efficiency problems inherent to the CEM 
approach. Yet, CEM is often useful in practice. 

3.2. Aitken's acceleration 

An early EM extension [12, 21, 22]. Aitken's acceleration is a general purpose technique to 
speed up the convergence of a fixed point recursion with asymptotic linear behavior. Section 2.5 
established that, under appropriate smoothness assumptions, EM may be approximated by a 
recursion of the form: 

9n+l ^ S9 + (I — S)9n, 

where 9 is the unknown limit and S is the speed matrix given by (8) which depends on this 
limit. Aitken's acceleration stems from the remark that, if S was known, then the limit could 

be computed explicitely in a single iteration, namely: 9 ^ 9q + S^^{6i — 9q) for some starting 
value 9q. Despite that S is unknown and the sequence is not strictly linear, we are tempted to 
consider the following modified EM scheme. Given a current parameter estimate 9n, 
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• E-step: compute (5(^|0„) and approximate the inverse speed matrix: = Jy(On) Jz{Gn)- 

• M-step: unchanged, get an intermediate value 0* = aig max. Q{0\6n)- 



• Acceleration step: update the parameter using 9n+i = dn + Sn^{9* — 9n)- 

It turns out that this scheme is nothing but the Newton-Raphson method to find a zero of 
6 t— > $(0) — 9, where $ is the map defined by the EM sequence, i.e. ^{9') = argm.sx.Q Q{9\9'). 
Since the standard EM sets 5„ = / on each iteration, it may be viewed as a first-order approach 
to the same zero-crossing problem, hence avoiding the expense of computing Sn- Beside this 
important implementational issue, convergence is problematic using Aitken's acceleration as the 
monotonicity property of EM is generally lost. 

3.3. AEM 

Accelerated EM [15]. To trade off between EM and its Aitken's accelerated version (see sec- 
tion 3.2), Jamshidian and Jennrich propose a conjugate gradient approach. Don't be messed 
up: this is not a traditional gradient-based method (otherwise there would be no point to talk 
about it in this report). No gradient computation is actually involved in here. The "gradient" 
is the function ^{9) — 9, which may be viewed as a generalized gradient for the incomplete- 
data log-likelihood, hence justifying the use of the generalized conjugate gradient method (see 
e.g. [27]). Compared to the Aitken's accelerated EM, the resulting AEM algorithm doesn't 
require computing the speed matrix. Instead, the parameter update rule is of the form: 

9n+l = 9n + Xndji, 

where d„ is a direction composed from the current direction $(^„) — 9n and the previous di- 
rections (the essence of conjugate gradient), and is a step size typically computed from a 
line maximization of the complete-data likelihood (which may or may not be cumbersome). As 
an advantage of line maximizations, the monotonicity property of EM is safe. Also, from this 
generalized gradient perspective, it is straightforward to devise EM extensions that make use of 
other gradient descent techniques such as the steepest descent or quasi-Newton methods [27]. 

3.4. ECM 

Expectation Conditional Maximization [23]. This variant (not to be confused with CEM, see 
above) was introduced to cope with situations where the standard M-step is intractable. It is 
the first on a list of coordinate ascent-based EM extensions. 

In ECM, the M-step is replaced with a number of lower dimensional maximization problems 
called CM-steps. This implies decomposing the parameter space as a sum of subspaces, which, 
up to some possible reparameterization, is the same as splitting the parameter vector into 
several blocks, 9 = {ti,t2, ■ ■ ■ ,ts). Starting from a current estimate 9n, the CM-steps update 
one coordinate block after another by partially maximizing the auxiliary Q-function, yielding a 
scheme similar in essence to Powell's multidimensional optimization method [27]. This produces 
a sequence 9n = 9n,o ^n,i ^ 9n,2 ■ ■ ■ ^ 9n,s-i 9n,s = 9n+i, such that: 

Q{9n\9n) < Q{9n,l\9n) < Q{9n,2\9n) < ■ ■ ■ < Q{9n,s-l\9n) < Q{9n+l\9n) 

Therefore, the auxiliary function is guaranteed to increase on each CM-step, hence globally 
in the M-step, and so does the incomplete-data likelihood. Hence, ECM is a special case of 
GEM (see section 2.3). 



8 



3.5. ECME 



ECM either [18]. This is an extension of ECM where some CM-steps are replaced with steps 
that maximize, or increase, the incomplete-data log-likelihood L{9) rather than the auxiliary 
Q-function. To make sure that the likelihood function increases globally in the M-stcp, the 
only requirement is that the CM-steps that act on the actual log-likelihood be performed after 
the usual Q-maximizations. This is because increasing the Q-function only increases likelihood 
from the starting point, namely 6n, which is held fixed during the M-step (at least, this is what 
we assume)^. 

Starting with Q-maximizations is guaranteed to increase the likelihood, and of course sub- 
sequent likelihood maximizations can only improve the situation. With the correct setting, 
ECME is even more general than GEM as defined in section 2.3 while inheriting its funda- 
mental monotonicity property. An example application of ECME is in mixture models, where 
typically mixing proportions are updated using a one-step Newton-Raphson gradient descent on 
the incomplete-data likelihood, leading to a simple additive correction to the usual EM update 
rule [18]. At least in this case, ECME has proved to converge faster than standard EM. 

3.6. SAGE 

Space- Alternating Generalized EM [13, 14]. In the continuity of ECM and ECME (see sec- 
tions 3.4 and 3.5), one can imagine defining an auxiliary function specific to each coordinate 
block of the parameter vector. More technically, using a block decomposition 6 = {ti,t2, ■ ■ ■ ,ts), 
we assume that, for each block i = l,...,s, there exists a function Qi{9\9') such that, for 
all 9 and 9' with identical block coordinates except (maybe) for the i-th block, we have: 
L{9) - L{9') > Qi{9\9') - Q^{9'\9'). 

This idea has two important implications. First, the usual ECM scheme needs to be rephrased, 
because changing the auxiliary function across CM-steps may well result in decreasing the 
likelihood, a problem worked around in ECME with an appropriate ordering of the CM-steps. In 
this more general framework, though, there may be no such fix to save the day. In order to ensure 
monotonicity, at least some CM-steps should start with "reinitializing" their corresponding 
auxiliary function, which means... performing an E-step. It is important to realize that, because 
the auxiliary function is coordinate-specific, so is the E-step. Hence, each "CM-step" becomes 
an EM algorithm in itself which is sometimes called a "cycle". We end up with a nested 
algorithm where cycles are embedded in a higher-level iterative scheme. 

Furthermore, how to define the Qi's? From section 2, wc know that the standard EM auxiliary 
function Q{9\9') is built from the complete-data space Z; sec in particular equation (5). Fessler 
&; Herro introduce hidden-data spaces, a concept that generalizes complete-data spaces in the 
sense that hidden-data spaces may be coordinate-specific, i.e. there is a hidden variable Zi for 
each block ti. Formally, given a block decomposition 9 = {ti,t2, ■ ■ ■ ,ts), Zi is a hidden-data 
space for ti if: 

V{zi,y\9) = p{y\zi, {tj^i}) p{zi\9) 

This definition's main feature is that the conditional probability of Y knowing Zi is allowed 
to be dependent on every parameter block but U. Let us check that the resulting auxiliary 

^ For example, if one chooses 9* such that L{0*) > L{6n) and, then, On+i such that Q{6n+i\dn) > Q{d*\9n)-, 
the only conlusion is that the likelihood increases from 0„ to 9* , but may actually decrease from 6* to 6„+i 
because 6* is not the starting point of Q. Permuting the L-maximization and the Q-maximization, we 
have Q(9''\9n) > Q{On\Gn), thus L(9*) > L{6n), and therefore L(9„+i) > L{9n) since we have assumed 
L{9n+i) > L{9*). This argument generalizes easily to any intermediate sequence using the same cascade 
inequalities as in the derivation of ECM (see section 3.4). 
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function fulfils the monotonicity condition. We define: 



QMO')^E[logp{Z,\e)\y,e'] 



Then, applying Jensen's inequality (3) like in section 2.2, we get: 



L{e) - Lie') > Qiie\e') - Qi{e'\e') + 



log 



p{y\zi,o) 
p{y\zi,0') 



p{zi\y,e') dzi 



When 6 and 6' differ only by tj, the integral vanishes because the conditional pdi p{y\zi, 6) 
is independent from tj by the above definition. Consequently, maximizing Qi{0\6') with respect 
to ti only (the other parameters being held fixed) is guaranteed to increase the incomplete-data 
likelihood. Specific applications of SAGE to the Poisson imaging model or penalized least- 
squares regression were reported to converge much faster than standard EM. 



Component- wise EM for Mixtures [4]. Celcux et al extend the SAGE methodology to the case 
of constrained likelihood maximization, which arises typically in mixture problems where the 
sum of mixing proportions should equate to one. Using a Lagrangian dualization approach, they 
recast the initial problem into unconstrained maximization by defining an appropriate penalized 
log- likelihood function. The CEMM algorithm they derive is a natural coordinatewisc variant 
of EM whose convergence to a stationary point of the likelihood is established under mild 
regularity conditions. 



Alternating ECM [24, 25]. In an attempt to summarize earlier contributions, Meng & van 
Dyk propose to cast a number of EM extensions into a unified framework, the so-called AECM 
algorithm. Essentially, AECM is a SAGE algorithm (itself a generalization of both ECM and 
ECME) that includes another data augmentation trick. The idea is to consider a family of 
complete-data spaces indexed by a working parameter a. More formally, we define a joint pdf 
q{z,y\6,a) as depending on both 9 and a, yet imposing the constraint that the corresponding 
marginal incomplete-data pdf be preserved: 



and thus independent from a. In other words, a is identifiable only given the complete data. 
A simple way of achieving such data augmentation is to define Z = /^^^(Zo), where Zq is some 
reference complete-data space and f^^a is a one-to-one mapping for any (6, a). Interestingly, it 
can be seen that a has no effect if fe^a is insensitive to 9. In AECM, a is tuned beforehand so 
as that to minimize the fraction of missing data (8), thereby maximizing the algorithm's global 
speed. In general, however, this initial mimimization cannot be performed exactly since the 
global speed may depend on the unknown maximum likelihood parameter. 



Parameter- Expanded EM [19, 17]. Liu et al revisit the working parameter method suggested 
by Meng and van Dyk [24] (sec section 3.8) from a slighlty different angle. In their strategy, the 
joint pdf q{z, y\9, a) is defined so as to meet the two following requirements. First, the baseline 



3.7. CEMM 



3.8. AECM 




3.9. PX-EM 



10 



model is embedded in the expanded model in the sense that q{z,y\9,aQ) = p{z,y\6) for some 
null value uq. Second, which is the main difference with AECM, the observed-data marginals 
are consistent up to a many-to-one reduction function r{6,a), 



for all (6, a). From there, the trick is to to "pretend" estimating a iteratively rather than 
pre-processing its value. 

The PX-EM algorithm is simply an EM on the expanded model with additional instructions 
after the M-step to apply the reduction function and reset a to its null value. Thus, given 
a current estimate 9n, the E-step forms the auxiliary function corresponding to the expanded 
model from {9n, ao), which amounts to the standard E-step because a = ckq. The M-step then 
provides {9*, a*) such that q{y\9*,a*) > q{y\9n,ao), and the additional reduction step updates 
9n according to 6n+i = r{9*,a*), implying p{y\dn+i) = q{_y\d*,a*). Because q{y\9n,ao) = 
p{y\9n) by construction of the expanded model, we conclude that p(y\9n+i) > p{y\Qn)i which 
shows that PX-EM preserves the monotonicity property of EM. 

In some way, PX-EM capitalizes on the fact that a large deviation between the estimate of a 
and its known value ckq is an indication that the parameter of interest 6 is poorly estimated. 
Hence, PX-EM adjusts the M-step for this deviation via the reduction function. A variety of 
examples where PX-EM converges much faster than EM is reported in [19]. Possible variants 
of PX-EM include the coordinatewise extensions underlying SAGE. 

3.10. Incremental EM 

Following the maximization-maximization approach discussed in section 2.7, Neal k. Hinton [26] 

address the common case where observations are i.i.d. Then, we have p{y\9) = Y\iPiyi\9) and, 
similarly, the global EM objective function (10) reduces to: 



where we can search for q under the factored form q{z) = Hi^il^)- Therefore, for a given 9, 
maximizing L{9,q) wrt q is equivalent to maximizing the contribution of each data item wrt qi, 
hence splitting the global maximization problem into a number of simpler maximizations. Incre- 
mental EM variants are justified from this remark, the general idea being to update 9 by visiting 
the data items sequencially rather than from a global E-step. Neal Sz Hinton demonstrate an 
incremental EM variant for mixtures that converges twice as fast as standard EM. 



Another suggestion of Neal &: Hinton [26] is to track the auxiliary distribution q{z) in a subspace 
of the original search space Q (at least for a certain number of iterations). This general strategy 
includes sparse EM variants where q is updated only at pre-defined plausible unobserved values. 
Alternatively, "winner-take- all" EM variants such as the GEM algorithm [6] (sec section 3.1) 
may be seen in this light. Such procedures may have strong computational advantages but, 
in counterpart, are prone to estimation bias. In the maximization-maximization interpretation 
of EM, this comes as no surprise since these approaches "project" the estimate on a reduced 
search space that may not contain the maximum likelihood solution. 




L{9, q)=J2 {EgJ logp(^i, yi\9)] + H{qi)}, 



3.11. Sparse EM 
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4. Stochastic EM variants 



While deterministic EM variants were mainly motivated by convergence speed considerations, 
stochastic variants are more concerned with other limitations of standard EM. One is that the 
EM auxiliary function (4) involves computing an integral that may not be tractable in some 
situations. The idea is then to replace this tedious computation with a stochastic simulation. 
As a typical side effect of such an approach, the modified algorithm inherits a lesser tendancy 
to getting trapped in local maxima, yielding improved global convergence properties. 



4.1. SEM 

Stochastic EM [5]. As noted in section 2.4, the standard EM auxiliary function is the best 
estimate of the complete-data log-likelihood in the sense of the conditional mean squared error. 
The idea underlying SEM, like other stochastic EM variants, is that there might be no need to 
ask for such a "good" estimate. Therefore, SEM replaces the standard auxiliary function with: 

Q{e\e') = \ogp{z'\e'), 

where z' is a random sample drawn from the posterior distribution of the unobserved variable^, 
p{z\y,9'). This leads to the following modified iteration; given a current estimate 9^- 

• Simulation step: compute p{z\y,9n) and draw an unobserved sample Zn from 

• Maximization step: find 9n+i = argmaxp(2:„|0). 



By construction, the resulting sequence 9n is an homogeneous Markov chain'' which, under 
mild regularity conditions, converges to a stationary pdf. This means in particular that 9n 
doesn't converge to a unique value! Various schemes can be used to derive a pointwise limit, 
such as averaging the estimates over iterations once stationarity has been reached (see also 
SAEM regarding this issue). It was established in some specific cases that the stationary pdf 
concentrates around the likelihood maximizer with a variance inversly proportional to the sam- 
ple size. However, in cases where several local maximizers exist, one may expect a multimodal 
behavior. 



4.2. DA 

Data Augmentation algorithm [28]. Going further into the world of random samples, one may 
consider replacing the M-step in SEM with yet another random draw. In a Bayesian context, 
maximizing p{zn\9) wrt 9 may be thought of as computing the mode of the posterior distribution 
P{0\zn), given by: 

where we can assume a flat (or non-informative) prior distribution for 9. In DA, this maximiza- 
tion is replaced with a random draw 9n+i ~ p{9\zn)- From equation (1), we easily check that 
p{9\zn) = p{9\zn,y)- Therefore, DA alternates conditional draws Zn\{9n,y) and ^n+iK-^n, y), 
which is the very principle of a Gibbs sampler. Results from Gibbs sampling theory apply, and 



^Notice that when Z is defined as Z = {X, Y), this simulation reduces to a random draw of the missing data X. 
^The draws need to be mutually independent conditional to (^i, O2, ■ ■ ■ , On), i.e. p{zi, 02, ■ ■ ■ , Zn\0i,02, ■ ■ ■ , On) = 
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it is shown under general conditions that the sequence On is a Markov chain that converges 
in distribution towards p{0\y). Once the sequence has reached stationarity, averaging 0^ over 
iterations yields a random variable that converges to the conditional mean E(^|y), which is an 
estimator of generally different from the maximum likelihood but not necessarily worse. 

Interesting enough, several variants of DA have been proposed recently [20, 17] following the 
parameter expansion strategy underlying the PX-EM algorithm described in section 3.9. 

4.3. SAEM 

Stochastic Approximation type EM [3]. The SAEM algorithm is a simple hybridation of EM 
and SEM that provides a pointwise convergence as opposed to the erratic behavior of SEM. 
Given a current estimate On, SAEM performs a standard EM iteration in addition to the SEM 
iteration. The parameter is then updated as a weighted mean of both contributions, yielding: 

where < 7„ < 1. Of course, to apply SAEM, the standard EM needs to be tractable. 

The sequence 7„ is typically chosen so as to decrease from 1 to 0, in such a way that the 
algorithm is equivalent to SEM in the early iterations, and then becomes more similar to EM. 
It is established that SAEM converges almost surely towards a local likelihood maximizer (thus 
avoiding saddle points) under the assumption that 7„ decreases to with lim„_;.oo(7„/7„+i) = 1 
and J2n^n oo. 

4.4. MCEM 

Monte Carlo EM [30]. At least formally, MCEM turns out to be a generahzation of SEM. In 
the SEM simulation step, draw m independent samples Zn^ , Zn^ , . . . , zii^^ instead of just one, 
and then maximize the following function: 

^ m 

Q{0\On) = -Y,logp{zi^'^\O), 
3=1 

which, in general, converges almost surely to the standard EM auxiliary function thanks to the 
law of large numbers. 

Choosing a large value for m justifies calling this Monte Carlo something. In this case, 
Q may be seen as an empirical approximation of the standard EM auxiliary function, and the 
algorithm is expected to behave similarly to EM. On the other hand, choosing a small value 
for m is not forbidden, if not advised (in particular, for computational reasons). We notice that, 
for m = 1, MCEM reduces to SEM. A possible strategy consists of increasing progressively the 
parameter m, yielding a "simulated annealing" MCEM which is close in spirit to SAEM. 

4.5. SAEM2 

Stochastic Approximation EM [11]. Delyon et al propose a generalization of MCEM called 
SAEM, not to be confused with the earlier SAEM algorithm presented in section 4.3, although 
both algorithms promote a similar simulated annealing philosophy. In this version, the auxiliary 
function is defined recursively by averaging a Monte Carlo approximation with the auxiliary 
function computed in the previous step: 

m„ 

Qn{0) = (1 -7n)Qn-l(^) + ^ V logp(4-'') |0), 
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where Zn ,Zn , ■ ■ ■ , Zn" are drawn independently from p{z\y, On)- The weights 7^ are typically 
decreased across iterations in such a way that Qn{0) eventually stabilizes at some point. One 
may either increase the number of random draws mn, or set a constant value nin = 1 when sim- 
ulations have heavy computanional cost compared to the maximization step. The convergence 
of SAEM2 towards a local likelihood maximizer is proved in [11] under quite general conditions. 

Kuhn et al [16] further extend the technique to make it possible to perform the simulation 
under a distribution 1151^(2:) simpler to deal with than the posterior pdf p{z\y,9n)- Such a 
distribution may be defined as the transition probability of a Markov chain generated by a 
Metropolis-Hastings algorithm. If Tlgi^z) is such that its associated Markov chain converges 
to p{z\y,9), then the convergence properties of SAEM2 generalize under mild additional as- 
sumptions. 



5. Conclusion 

This report's primary goal is to give a flavor of the current state of the art on EM-type statistical 
learning procedures. We also hope it will help researchers and developers in finding the literature 
relevant to their current existential questions. For a more comprehensive overview, we advise 
some good tutorials that are found on the internet [8, 2, 1, 10, 29, 17]. 



A. Appendix 

A.l. Maximum likelihood quickie 

Let Y a random variable with pdf p{y\9)^ where 9 is an unknown parameter vector. Given an 
outcome y of y , maximum likelihood estimation consists of finding the value of 9 that maximizes 
the probability p{y\9) over a given search space G. In this context, p{y\9) is seen as a function 
of 9 and called the likelihood function. Since it is often more convenient to manipulate the 
logarithm of this expression, we will focus on the equivalent problem of maximizing the log- 
likelihood function: 

9{y) = argmaxL(y, 0) 

where the log-hkelihood L{y,9) = logp{y\9) is denoted L{y,9) to emphasize the dependance in 
y, contrary to the notation L{9) usually employed throghout this report. Whenever the log- 
likelihood is differentiable wrt 9, we also define the score function as the log-likelihood gradient: 

S{y,9) = —iy,9) 

In this fundamental result is that, for all vector U{y,9), we have: 



E(5C.) = ^E(t,')-E(^) 



where the expectation is taken wrt the distribution p{y\9). This equality is easily obtained from 
the logarithm differentiation formula and some additional manipulations. Assigning the "true" 
value of 9 in this expression leads to the following: 

• E{S) = 

• Cov(5, 5) = — E^-^j (Fisher information) 
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• If U{y) is an unbiased estimator of 6, then Cov(S', U) = Id. 

• In the case of a single parameter, the above result implies Var(t/) > Yar(s) f'^o™ 
Cauchy-Schwartz inequality, i.e. the Fisher information is a lower bound for the variance 
of U. Equality occurs iff U is an afiine function of S, which imposes a specific form to 
p{y\0) (Darmois theorem). 

A. 2. Jensen's inequality 

For any random variable X and any real continuous concave function /, we have: 

/[E(X)]>E[/(X)], 
If / is strictly concave, equality occurs iff X is deterministic. 
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